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Abstract 

Posterior probabilistic statistical inference without priors is an important but so 
far elusive goal. Fisher's fiducial inference, Dempster-Shafer theory of belief func- 
tions, and Bayesian inference with default priors are attempts to achieve this goal 
but, to date, none has given a completely satisfactory picture. This paper presents a 
new framework for probabilistic inference, based on inferential models (IMs), which 
not only provides data-dependent probabilistic measures of uncertainty about the 
unknown parameter, but does so with an automatic long-run frequency calibration 
property. The key to this new approach is the identification of an unobservable 
auxiliary variable associated with observable data and unknown parameter, and the 
prediction of this auxiliary variable with a random set before conditioning on data. 
Here we present a three-step IM construction, and prove a frequency-calibration 
property of the IM's belief function under mild conditions. A corresponding opti- 
mally theory is developed, which helps to resolve the non-uniqueness issue. Several 
examples are presented to illustrate this new approach. 

Keywords and phrases: Belief function; plausibility function; predictive random 
set; score function; validity 



1 Introduction 

In a statistical inference problem, one attempts to convert experience, in the form of 
observed data, to knowledge about the unknown parameter of interest. The fact that ob- 
served data is surely limited implies that there will be some uncertainty in this conversion, 
and probability is a natural tool to describe this uncertainty. But a statistical inference 
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problem is different from the classical probability setting because everything — observed 
data and unknown parameter — is fixed, and so it is unclear where these probabilistic as- 
sessments of uncertainty should come from, and how they should be interpreted. For ex- 
ample, the classical frequentist approach assigns probabilistic assessments of uncertainty 
(e.g., confidence levels) by considering repeated sampling from the super-population of 
possible data sets. These uncertainty measures do not depend on the observed data, 
so their meaningfulness in a given problem is questionable. The Bayesian approach, on 
the other hand, is able to produce meaningful data- dependent probabilistic measures of 
uncertainty, but the cost is that a prior probability distribution for the unknown param- 
eter is required. Early efforts to g et probabilist ic inference witho ut prior specification 
include Fisher's fiduc ial inference (jZabelll I19921 ) and its variants (IHannig 120091 . |2012| ; 



Hannig and Led l2009l ). confi dence distrib utions fjXie and Singhl 120121 ; Kie et al.l 1201 lh. 



Eraser's structural inference (lFraserlll968l ). and the Dempster-Shafer theory (IDempster 
20081 ; IShaferlll976[ ). These methods generate probabilities for inference, but these prob- 
abilities may not be easy to interpret, e.g., they may not be properly calibrated across 
users or experiments. So recent efforts have focused on incorporating a fr equentist ele- 



ment. In particular, objective Bayes analysis wi th default /reference priors ( iBergerl 12006 



Berger et al.l 120091 ; iBernardol Il979l ; iGhoshl l201ll ) attempts to construct priors for which 



cert ain posterior inferences, such as credible intervals, closely match that of a f r equen- 
tist (lFraserll201ll ; iFraser et allboioh . Calibrated Bayes (|Pawidlll985l : iLittld 1201 ll : iRubin 



19841 ) has similar motivations. But difficulties remain in choosing good reference priors 



for high- dimensional problems so, despite these efforts, a fully satisfactory framework of 
objective Bayes inference has yet to emerge. 

The goal of this paper is to develop a new framework for statist ical inference, called 
infe rential models (IMs). T he seeds for this idea were first planted in iMartin et al.l ( 120101 ) 
and IZhang and Liul (120 111 ) ; here we formalize and extend these ideas towards a cohesive 
framework for statistical inference. The jumping off point is a simple association of 
the observable data X and unknown parameter 9 G O with an unobservable auxiliary 
variable U . For example, consider the simple signal plus noise model, X = 9 + U, where 
U ~ N(0, 1). If X = x is observed, then we know that x = 9 + u*, where u* is some 
unobserved realization of U. From this it is clear that knowing u* is equivalent to knowing 
9. So the IM approach attempts to accurately predict the value u* before conditioning on 
X = x. The benefit of focusing on u* rather than 9 is that more information is available 
about u*: indeed, all that is known about 9 is that it sits in G, while u* is known to be 
a realization of a draw U from an a priori distribution, in this case N(0, 1), that is fully 
specified by the postulated sampling model. However, this a priori distribution alone 
is insufficient for accurate prediction of u*. Therefore, we adopt a so-called predictive 
random set for predicting u*, which amounts to a sort of "smearing" of this distribution 
for U . When combined with the association between observed data, parameters, and 
auxiliary variables, these random sets produce prior-free, data-dependent probabilistic 
assessments of uncertainty about 9. 

To summarize, an IM starts with an association between data, parameters, and auxil- 
iary variables and a predictive random set, and produces prior-free, post-data probabilistic 
measures of uncertainty about the unknown parameter. The following associate-predict- 
combine steps provide a simple yet formal IM construction. The details of each of these 
three steps will be fleshed out in Section [2J 
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A-step. Associate the unknown parameter 9 to each possible (x, u) pair to obtain a 
collection of sets Q x (u) of candidate parameter values. 

P-step. Predict u* with a valid predictive random set S. 

C-step. Combine X = x, Q x (u), and S to obtain a random set Q X (S) = LLes 
Then, for any assertion AC 9, compute the probability that the random set Q X (S) is a 
subset of A as a measure of the available evidence in x supporting A. 

The A-step is meant to emphasize the use of unobservable but predictable auxiliary 
variables in the statistical modeling step. These auxiliary variables make it possible to 
introduce posterior probability-like quantities without a prior distribution for 9. The 
P-step is new and unique to the inferential model framework. The key is that Q X (S) 
contains the true 9 if and only if S contains u*. Then the validity condition in the P-step 
ensures that S will hit its target with large probability which, in turn, guarantees that 
probabilistic output from the C-step has a desirable frequency- calibration property. This, 
together with its dependence on the observed data x, makes the IM's probabilistic output 
meaningful both within and across experiments. 

The remainder of the paper is organized as follows. Section [2jprovides the details of the 
IM analysis, specifically the three-step construction outlined above, as well as a descrip- 
tion of calculation and interpretation of the IM output: a posterior belief function. These 
ideas are illustrated with a simple Poisson mean example. After arguing, in Section [2J 
that the IM output provides a meaningful summary of one's uncertainty about 9 after 
seeing X = x, we prove a frequency calibration property of the posterior belief functions 
in Section |3] which establishes the meaningfulness of the posterior belief function across 
different users and experiments. As a consequence of this frequency-calibration property, 
we show in Section 13.41 that the IM output can easily be used to design new frequentist 
decision procedures having the desired control on error probabilities, etc. Some basic but 
fundamental results on IM optimality are presented in Section HJ Section |5] gives IM- 
based solutions to two non-trivial examples, both involving some sort of marginalization. 
Nonetheless, these examples are relatively simple and they illustrate the advantages of the 
IM approach. Concluding remarks are given in Section El and R codes for the examples 



are available on the first author's website: www.math.uic.edu/~rgmartin 



2 Inferential models 

2.1 Auxiliary variable associations 

If X denotes the observable sample data, then the sampling model is a probability distri- 
bution Px\e on the sample space X, indexed by a parameter 9 G 0. Here X may consist 
of a collection of n (possibly vector-valued) data points, in which case both Px\e and X 
would depend on n. The sampling model for X is induced by an auxiliary variable U, for 
given 9. Let U be an (arbitrary) auxiliary space, equipped with a probability measure 
Pi/. In applications, U can often be a unit hyper-cube and Pjj Lebesgue measure. The 
sampling model Px\e shall be determined by the following "algorithm:" 

sample U ~ Pjj and set X = a(U,9), (2.1) 
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for an appropriate mapping a : U x — > X. The key is the association of the observ- 
able X, the unknown 9, and the auxiliary variable U through the relation X = a(U,0). 
This particular formulation of the sampling model is not really a restriction. In fact, 
the two-step construction of the observable X in fl2.ll) is often consistent with scientific 
understanding of the underlying process under investigation; linear models form an in- 
teresting class of examples. As another example, suppose X = (Xi, . . . ,X n ) consists of 
an independent sample from a continuous distribution. If the corresponding distribution 
function Fg is invertible, then a{9, U) may be written as 

a(9,U) = (F e ~\U 1 ),...,F e -\U n )), (2.2) 

where U = (Ui, . . . , U n ) is a set of independent Unif (0, 1) random variables. 

The notation X = a(9, U) chosen to represent the association between (X, 8, U) is 
just for simplicity. In fact, this association need not be described by a formal equation. 
As the Poisson example below shows, all we need is a recipe, like that in (12.11) . describing 
how to produce a sample X, for a given 9, based on a realization U ~ Pjj. 

Gaussian Example. Consider the problem of inference on the mean 9 based on a single 
sample X ~ N(6 1 , 1). In this case, the association linking X, 9, and an auxiliary variable U 
may be written as U = $(A — 9) or, equivalently, X = 9 + §~ 1 (U), where U ~ Unif (0, 1), 
and $ is the standard Gaussian distribution function. 

Poisson Example. Consider the problem of inference on the mean 9 of a Poisson 
population based on a single observation X. For this discrete problem, the association 
for X, given 9, may be written as 

F g (X-l) <l-U<F e (X), U~ Unif(0,l), (2.3) 

where Fg denotes the Pois(#) distribution function. This representation is familiar for 
simulating X ~ Pois(#), i.e., one can first sample U ~ Unif (0, 1) and then choose X so 
that the inequalities in (12. 3p are satisfied. But here we also interpret (12. 3p as a means to 
link data, parameter, and auxiliary variable. 

It should not be surprising that, in general, there are many associations for a given 
sampling model. In fact, for a given sampling model Px\e, there are as many associations 
as there are triplets (U, Pjj, a) such that Px\e equals the push-forward measure Puo-g 1 , 
with ag(-) = a(9, ■). For example, if X ~ 1X1(6*, 1), then each of the following defines an 
association: X = 9 + U with U ~ N(0, 1), X = 9 + $-\U) with U ~ Unif(0, 1), and 

X=l e + U if ^°' withfZ-NM. 
[9-U if 9 < 0, v ' 

Presently, there appears to be no strong reason to choose one of these associations over 
the other. However, the optimality theory presented in Section H] helps to resolve this 
non-uniqueness issue, that is, the optimal IM depends only on the sampling model, and 
not on the chosen association. From a practical point of view, we prefer, for continuous 
data problems, associations which are continuous in both 9 and U, which rules out the 
latter of the three associations above. Also, we tend to prefer the representation with a 
uniform U, any other choice being viewed as just a reparametrization of this one. It will 
become evident that this view is without loss of generality for simple problems with a 
one-dimensional auxiliary variable. The case when U is moderate- to high-dimensional is 
more challenging and we defer its discussion to Section [6j 
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2.2 Three-step IM construction 

2.2.1 Association step 

The association ( 12.1 p plays two distinct roles. Before the experiment, the association 
characterizes the predictive probabilities of the observable X. But once X = x is ob- 
served, the role of the association changes. The key idea is that the observed x and the 
unknown 9 must satisfy 

x = a(u\9) (2.4) 

for some unobserved realization u* of U. Although u* is unobserved, there is information 
available about the nature of this quantity; in particular, we know exactly the distribution 
Pjj from which it came. 

Of course, the value of u* can never be known, but if it were, the inference problem 
would be simple: given X — x, just solve the equation x = a(u*, 9) for 9. More generally, 
one could construct the set of solutions Q x (u*), where 

B x (u) = {9:x = a{u, 9)}, x E X, uEV. (2.5) 

For continuous- data problems, Q x (u) is typically a singleton for each u; for other prob- 
lems, it could be a set. In either case, given X = x, Q x (u*) represents the best possible 
inference in the sense that the true 9 is guaranteed to be in Q x (u*). 

Gaussian Example (cont). The Gaussian mean problem is continuous, so the as- 
sociation x = 9 + $~ 1 (m) identifies a single 9 for each fixed (x,u) pair. Therefore, 
®x(u) = {x — $ _1 (n)}. In this case, clearly, if u* were somehow observed, then the true 
9 could be determined with complete certainty. 

Poisson Example (cont). Integration-by-parts reveals that the Pois(6 l ) distribution 
function Fg satisfies F$(x) = l—G x+ \(9), where G a is a Gamma(a, 1) distribution function. 
Therefore, from (12. 3p . we get the w-interval G x+ \{9) < u < G x {9). Inverting this it- 
interval produces the following ^-interval: 

e x («) = (G- 1 («) J G^ 1 (u)]. (2.6) 

If u* was available, then Q x (u*) would provide the best possible inference in the sense 
that the true value of 9 is guaranteed to sit inside this interval. But even in this ideal 
case there is no information available to identify the exact location of 9 in O x (u*). 

2.2.2 Prediction step 

The above discussion highlights the importance of the auxiliary variable for inference. It 
is, therefore, only natural that the inference problem should focus on accurately predicting 
the unobserved u*. To predict u* with a certain desired accuracy, we employ a so-called 
a predictive random set. First we give the simplest description of a predictive random set 
and provide a useful example. More general descriptions will be given later. 

Let u i — y S(u) be a mapping from U to a collection of P;y-measurable subsets of U; one 
decent example of such a mapping S is given in equation (12 .7p below. Then the predictive 
random set S is obtained by applying the set-valued mapping S to a draw U ~ Pu, i-e., 
S = S(U) with U ~ Pu- The intuition is that if a draw U ~ Pu is a good prediction 
for the unobserved u*, then the random set S = S(U) should be even better in the sense 
that there is high probability that S 3 u*. 
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Gaussian Example (cont). In this example we may predict the unobserved u* with a 
predictive random set S defined by the set- valued mapping 

S(u) = {v! G (0,1) : |u'-0.5| < |«-0.5|}, u G (0, 1). (2.7) 

As this predictive random set is designed to predict an unobserved uniform variate, we 
may also employ ( 12. 7p in other problems, including the Poisson example. 

There are, of course, other choices of S(u), e.g., [0, u), (u, 1], (0.5m, 0.5 + 0.5m) and 
more. Although some other choice of S = S(U) might perform slightly better depending 
on the assertion of interest, (12. 7p seems to be a good default choice, provided that the 
association satisfies certain monotonicity conditions. See Sections [3] and H] for more on 
the choice predictive random sets. 

For the remainder of this paper, we shall mostly omit the set-valued mapping S 
from the notation and speak directly about the predictive random set S. That is, the 
predictive random set S will be just a random subset of U with distribution P$. In the 
above description, P s is just the push-forward measure P^S* -1 . 



2.2.3 Combination step 

For the time being, let us assume that the predictive random set S is satisfactory for 
predicting the unobserved u*; this is actually easy to arrange, but we defer discussion 
until Section [3J To transfer the available information about u* to the #-space, our last 
step is to combine the information in the association, the observed X = x, and the 
predictive random set S. The intuition is that, if u* G S, then the true 9 must be in the 
set Q x (u), from (I2.5|) . for at least one u G S. So, logically, it makes sense to consider, for 
inference about 9, the expanded set 

e.(s) = lLse„(ti). (2-8) 

The set Q X (S) contains those values of 9 which are consistent with the observed data and 
sampling model for at least one candidate u* value u G S. Since 9 G Q X (S) if and only 
if the unobserved u* G S, if we are willing to accept that the predictive random set S is 
satisfactory for predicting u*, then Q X (S) will do equally well at capturing 9. 

Now consider an assertion A about the parameter of interest 9. Mathematically, an 
assertion is just a subset of 0, but it acts much like a hypothesis in the context of classical 
statistics. To summarize the evidence in x that supports the assertion A, we calculate 
the probability that Q X (S) is a subset of A, i.e., 

be\ x (A) = P S {Q X (S) C A | B X (S) ^ 0}. (2.9) 

We refer to bel x (y4) as the belief function at A. Naturally, bel^ also depends on the choice 
of association and predictive random set, but for now we suppress this dependence in the 
notation. Ther e are some s imilarities between our belief function and that of Dempster- 



Shafer theory (jShaferlll976l ). For example, bel x is subadditive in the sense that if A is a 
non-trivial subset of 6, then bel x (A) + bel x (A c ) < 1 with equality if and only if Q X (S) is a 
singleton with P^-probability 1. However, our use of the predictive random set (and our 
em phasis on validity in Section [3]) separates our approach from that of Dempster-Shafer; 
sec 



Martin et all (l2010h . 
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Here we make two technical remarks about the belief function in (I2.9p . First, in 
the problems considered in this paper, the case Q X (S) = is a P^-null event, so the 
belief function can be simplified as bel x (A) = P^O^iS) C A}, no conditioning. This 
simplification may not hold in problems where the observation X = x can induce a 
constraint on the auxiliary variable u. For example, consider the Gaussian example from 
above, but suppose that the mean is known to satisfy 9 > 0. In this case, it is easy to check 
that Q X (S) = iff $ _1 (inf S) > x, an event which generally has positive P^-probability. 
So, in general, we can ignore conditioning provided that 

Q x (u) ^ for all x and all u. (2.10) 

The IM fra mework can be modifi e d in cases where (12.101) fails, but we will not discuss 



this here; see lErmini Leaf and Liul fl2012[ ). Second, measurability of bel x (A), as a function 
of x for given A, which is important in what follows, is not immediately clear from the 
definition and should be assessed case-by-case. However, in our experience and in all 
examples herein, bel^A) is a nice measurable function of x. 

Unlike with an ordinary additive probability measure, to reach conclusions about A 
based on be\ x one must know both be\ x (A) and bel x .(v4 c ) ; for example, in the extreme case 
of "total ignorance" about A, one has bel x (A) = bel x (y4 c ) = 0. It is often more convenient 
to work with a different but related function 

p\ x {A) = 1 - be\ x (A c ) = P S {Q X (S) % A c | Q X (S) ^ 0}, (2.11) 

called the plausibility function at A; when A = {9} is a singleton, we write p\ x (9) instead 
of pl z ({#}). From the subadditivity of the belief function, it follows that bel x (v4) < pl^A) 
for all A. In what follows, to summarize the evidence in x supporting A, we shall report 
the pair bel^A) and pl x (v4), also known as lower and upper probabilities. 

Gaussian Example (cont). With the predictive random set S in (12.71) . the random set 
Q X (S) is given by 

= (x- $ _1 (0.5 +\U- 0.5|), x - $ -1 (0.5 - \U - 0.5|)) 
= &(U), Q X (U)), say, 

where U ~ Unif(0, 1). For a singleton assertion A = {9}, it is easy to see that the belief 
function is zero. But the plausibility function is 

p\ x (9) = 1 - Pu{& x (U) >9}- Pu{Q x (U) < 9} 

= 1 - |2$(x- 9) - 1|. (2.12) 

A plot of p\ x {9), with x = 5, as a function of 9, is shown in Figure [U^a). The symmetry 
around the observed x is apparent, and all those 9 values in a neighborhood of x = 5 are 
relatively plausible. See Section 13.41 for more statistical applications of this graph. 

Poisson Example (cont). With the same predictive random set as in the previous 
example, the random set Q X (S) is given by 

= (G- 1 (0.5 - \U - 0.51)^(0.5 + \U - 0.5|)) 
= {Q*(U), Q X (U)), say, 
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(a) Gaussian example 



(b) Poisson example 



Figure 1: Plot of the plausibility functions p\ x (9), as functions of 9, in (a) the Gaussian 
example and (b) the Poisson example. In both cases, X = 5 is observed. 



where U is a random draw from Unif (0, 1). For a singleton assertion A = {9}, again the 
belief function is zero, but the plausibility function is 

p\ x (9) = 1 - Pu{Q x (U) >9}- Pu{& x (U) < 9} 

= 1 - max{l - 2G X (9), 0} - max{2G x+1 (9) - 1, 0}. (2.13) 

A graph of p\ x (9), with x = 5, as a function of 9 is shown in Figure [TJb). The plateau 
indicates that no 9 values in a neighborhood of 5 can be ruled out. Like in the Gaussian 
ex ample, 9 va l ues in an interval around 5 are all relatively plausible. 



Dempsterl ( 120081 ) gives a different analysis of this same Poisson problem. His plau- 
sibility function for the singleton assertion A = {9} is r x (9) = e~ e 9 x /x\, which is the 
Poisson mass function treated as a function of 9. This function has a similar shape to 
that in Figure[T](b), but the scale is much smaller. For example, r 5 (5) = 0.175, suggesting 
that the assertion {9 = 5} is relatively implausible, even though X = 5 was observed. 
Compare this to pl 5 (5) = 1. We would argue that, if X = 5 is observed, then no plau- 
sibility function threshold should be able to rule out {9 = 5}; in that case, pl 5 (5) = 1 
makes more sense. Furthermore, as Dempster's analysis is similar to ours but with an 
invalid predictive random set, namely, S = {U}, with U ~ Unif (0, 1), the corresponding 
plausibility function is not properly calibrated for all assertions. 



2.3 Interpretation of the belief function 



It is clear that the belief function depends on the observed x and so must be meaningful 
within the problem at hand. But while it is data- dependent, bel x (A) is not a posterior 
probability for A in the familiar Bayesian sense. In fact, under our assumption that 9 is 
fixed and non-random, there can be no non-trivial posterior distribution on 0. The way 
around this limitation is to drop th e requirement that posterior inference be based on a 



bona fide probability measure (e.g.. lHeath and Sudderthlll978l ; IWalleylll996l ; IWasserman 



8 



1990). Therefore, we recommend interpreting bel x (y4) and bel x (y4 c ) as degrees of belief, 



rather than ordinary probabilities, even though they manifest from P [/-probability calcu- 
lations. More precisely, bel x (vl) and bel^^) represent the knowledge gained about the 
respective claims 6 £ A and 9 £" A based on both the observed x and prediction of the 
auxiliary variable. 



2.4 Summary 

The familiar sampling model appears in the A-step, but it is the corresponding associa- 
tion which is of primary importance. This association, in turn, determines the auxiliary 
variable which is to be the focus of the IM framework. We propose to predict the un- 
observed value of this auxiliary variable in the P-step with a predictive random set S, 
which is chosen to have certain desirable properties (see Definition [1] below). This use of 
a predictive random set is likely the aspect of the IM framework which is most difficult to 
swallow, but the intuition should be clear: one cannot hope to accurately predict a fixed 
value u* by an ordinary continuous random variable. With the association, predictive 
random set, and observed X = x in hand, one proceeds to the C-step where a random 
set ® X {S) on the parameter space is obtained. As this random set corresponds to a set of 
"reasonable" 9 values, given x, it is natural to summarize the support of an assertion A by 
the probability that Q X (S) is a subset of A. This probability is exactly the belief function 
that characterizes the output of the IM and an argument is presented that justifies the 
meaningfulness of beU(A) and pl^A) as summaries of the evidence in favor of A. 

Finally, we mention that the predictive random set S can depend on the assertion A of 
interest. That is, one might consider using one predictive random set, say Sa, to evaluate 
bel^A), and another predictive random set, say Sa°, to evaluate pl^A) = 1 — be\ x (A c ). 
In Section H] we show that this is actually a desirable strategy, in the sense that the 
optimal predictive random set depends on the assertion in question. In what follows, this 
dependence of the predictive random set on the assertion will be kept implicit. 



3 Theoretical validity of IMs 



3.1 Intuition 

In Section |2] we argued that be\ x (A;S) and p\ x (A;S) together provide a meaningful sum- 
mary of evidence in favor of A for the given X = x\ our notation now explicitly indicates 
the dependence of these function on the predictive random set S. In this section we show 
that belx(^4;5) and p\ x (A;S) are also meaningful as functions of the random variable 
X ~ Px|6» f° r a fixed assertion A. For example, we show that belx(^4) is frequency- 
calibrated in the following sense: if 9 £" A, then Px|e{belx(^4; S) > 1 — a} < a for each 
a £ [0, 1]. In other words, the amount of evidence in favor of a false A can be large with 
only small probability. This property means that the belief function is appropriately 
scaled for objective scientific inference. A similar property also hold s for ply(A). We 



refer t o this frequ ency-ca li bratio n property as validity (Definition [2]). iBernardol (119791 ). 
Rubin! (119841 ) and iDawidl (119851 ) give similar investigations of frequency- calibration and 
of objective priors for Bayesian inference. 
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3.2 Predictive random sets 



We start with a few definitions, similar to those found in Martin et al.l ( 120101 ) and 



Zhang and Liul ( 1201 ll ). Set Qs(u) = Ps{<5 ^ u }, f° r u £ U, which is the probability 
that the predictive random set S misses the specified target u. Ideally, S will be such 
that the random variable Qs{U), a function of U ~ Pjj, will be probabilistically small. 
This suggests a connection between P$ and Pjj which will be made precise in Theorem [TJ 

Definition 1. A predictive random set S is valid for predicting the unobserved auxil- 
iary variable if Qs(U), as a function of U ~ Pu, is stochastically no larger than Unif(0, 1), 
i.e., for each a G (0, 1), Pjj{Qs(U) > 1 — a} < a. If "< a" can be replaced by "= a," 
then S is efficient. 

In words, validity of S implies that the probability that it misses a target u is large for 
only a small P[/-proportion of possible u values. The predictive random set S defined by 
the mapping (12.71) is both valid and efficient. Indeed, it is easy to check that, in this case, 
Q s {u) = \2u - 1|. Therefore, if U ~ Unif(0, 1) then Q S (U) ~ Unif(0, 1) too. Corollary [TJ 
below gives a simple and general recipe for constructing a valid and efficient S. 

There is an important and apparently fundamental concept related to validity of 
predictive random sets, namely, nesting. We say that a collection of sets S C 2 U is nested 
if, for any pair of sets S and S' in S, we have S C S' or S' C S. We shall also implicitly 
assume, without loss of generality, that Pu(S) > for some S G S; the user can easily 
arrange this. The following theorem shows that if the predictive random set S is nested, 
i.e., if S is supported on a nested collection of sets S, then it is valid. 

Theorem 1. Let § C 2 U be a nested collection of P u -measurable subsets S of U. 
Define a predictive random set S with distribution Pg, supported on S, such that 

P S {S CK}= sup Pu(S), K C U, 

S&-.SCK 

where Pjj(-) = P u{ - )/ su P5g§ Pu{S)- Then S is valid in the sense of Definition^ 

Proof. The idea of the proof is that Qs{u) = Ps{S u} is large iff u sits outside a 
set that contains most realizations of S. To make this formal, take any a G (0, 1) and 
let S a = f]{S G S : Pu{S) > 1 — «} be the smallest set in § with P^-probability no less 
than 1 — a; here, the intersection over an empty collection of sets is taken to be U. Since 
S is nested, S a G S, Pu(S a ) > 1 — a, and 

Ps{S C S Q } = sup P V (S) = P v (S a ) > Pu(S a ) > 1 - a. 

ses-.scs a 

Therefore, since Qs(u) > 1 — a iff n ^ S a , we get Pu{Qs(U) > 1 — a} = Pu(S°) = 
1 — Pu(Sa) < ol. Finally, validity follows since a was arbitrary. □ 

It is clear that, if P^ is absolutely continuous and the nested support § is sufficiently 
rich, then the predictive random set defined above is also efficient. Specifically, if U G § 
and, for S a defined in the proof above, Pu(S a ) = 1 — a for every a G (0, 1). This vague 
argument for efficiency is made more precise in the next important special case. 
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Corollary 1. Suppose the Pu is non-atomic, and let h be a real-valued function on 
U. Then the predictive random set S = {u £ U : h{u) < h(U)}, with U ~ Pu, is valid. If 
h is continuous and constant only on Pu-null sets, then it is also efficient. 

Proof. Validity is a consequence of Theorem Q] and the fact that this S is nested. To 
prove the efficiency claim, let H be the distribution function of h(U) when U ~ Pu- 
Then, for u £ U, Qs{u) = Pu{h(U) < h(u)} = H(h(u)). If h satisfies the stated 
conditions, then h(U) is a continuous random variable. Therefore, if U ~ Pu-, then 
Q S {JJ) = H(h(U)) ~ Unif(0, 1), so efficiency follows. □ 

The above results demonstrate that nesting is a sufficient condition for pred ictive 



random set validity. But nesting is not a necessary condition ( jMartin et al.ll2010f ). The 
real issue, however, is the performance of the corresponding IM. We show in Section H] 
that for any non-nested predictive random set S, there is a nested predictive random set 
S' such that the IM based on S' is "at least as good" as that based on S. 

3.3 IM validity 

Validity of the underlying predictive random set S is essentially all that is needed to 
prove the meaningfulness of the corresponding IM/belief function. Here meaningfulness 
refers to a calibration property of the belief function. 

Definition 2. Suppose X ~ Px\e and let A be an assertion of interest. Then the IM 
with belief function bel x . is valid for A if, for each a £ (0, 1), 

supP x \ e {be\x{A;S) > 1 - a} < a. (3.1) 

The IM is valid if it is valid for all A. 

By (12.111) . the validity property can also be stated in terms of the plausibility function. 
That is, the IM is valid if, for all assertions A and for any a £ (0, 1), 

supPxi^P'x^S) <a}<a. (3.2) 

Theorem 2. Suppose the predictive random set S is valid, and Q X (S) ^ with 
Ps -probability 1 for all x. Then the IM is valid. 

Proof. For any A, take (x,u,8) such that 9 £" A and x = a(8,u). Since A C {9} c , 
be\ x (A;S) < be\ x ({6} c ; S) = P S {® X (S) J 0} = P S {S J u} by monotonicity. Validity 
of S implies that the right-hand side, as a function of U ~ Pu, is stochastically smaller 
than Unif (0, 1). This, in turn, implies the same of belx(^4; S) as a function of X ~ Px\e- 
Therefore, Px\e{be\x(A;S) > 1 — a} < P{Unif(0, 1) > 1 — a} — a. Taking a supremum 
over 6 ^ A on the left-hand side completes the proof. □ 

A key feature of the validity theorem above is that it holds under minimal conditions 
on the predictive random set. Validity of the IM does not depend on the particular form 
of predictive random set, only that it is valid. Recall that the condition U Q X {S) ^ w ith 



P^-probability 1" holds whenever f )2.10p holds. See, also. lErmini Leaf and Liul (120121 ) . 
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The following corollary states that the validity theorem remains true even after a 
suitable — possibly ^-dependent — change of auxiliary variable. In other words, the va- 
lidity property is independent of the choice of auxiliary variable parametrization. This 
reparametrization comes in handy in examples, including those in Section 

Corollary 2. Consider a one-to-one transformation v = ipe(u) such that the push- 
forward measure Py = Pu^Pq 1 on V = ty?e(U) does not depend on 9. Suppose S is valid 
for predicting v* = <pg(u*), and Q X (S) ^ with P s -probability 1 for all x. Then the 
corresponding belief function satisfies (13. ip and the transformed IM is valid. 



3.4 Application: IM-based frequentist procedures 

In addition to providing problem-specific measures of certainty about various assertions 
of interest, the belief /plausibility functions can easily be used to create frequentist pro- 
cedures. First consider testing Hq : 9 G A versus H\ : 9 G A c . Then an IM-based 
counterpart to a frequentist testing rule is of the following form: 

Reject H Q if pl I .(A) < a, for a specified a G (0, 1). (3.3) 

According to (13 .2p and Theorem El if the predictive random set S is valid, then the 
probability of a Type I error for such a rejection rule is sup^g^ P x \e{p\ x (A) 
Therefore, the test (I3.3P controls the probability of a Type I error at level a. 

Next consider the class of singleton assertions {9}, with 9 G O. As a counterpart to 
a frequentist confidence region, define the 100(1 — a)% plausibility region 

U x (a) = {9 : p\ x (9) > a}. (3.4) 

Now the coverage probability of the plausibility region (13 .4p is 

P xl e{U x (a) 3 9} = P xle {p\ x (9) > a} = 1 - P xle {p\ x (9) < a} > 1 - a, 

where the last inequality follows from Theorem [2j Therefore, this plausibility region has 
at least the nominal coverage probability. 

Gaussian Example (cont). Suppose X = 5. Then, using the predictive random set S 
in ([22D, the plausibility function is pl 5 (0) = 1 - |2$(5 - 9) - 1|. The 90% plausibility 
interval for 9, determined by the inequality pl 5 (#) > 0.10, is 5 ± $~ 1 (0.05), the same as 
the classical 90% ^-interval for 9 given in standard textbooks. 

Poisson Example (cont). For the predictive random set determined by S in (12. 7p . the 
plausibility function p\ x (9) is displayed in (I2.13p . For observed X = 5, a 90% plausibility 
interval for 9, characterized by the inequality pl 5 (9) > 0.10, is (1.97, 10.51). This interval 
is not the best possible; in fact, the one presented in Section T4.3.2I is better. But these 
plausibility intervals have exact coverage properties, which means that they may be too 
conservative at certain 9 values for practical use. This is th e case for all exact intervals 



in discrete data problems (e.g., iBrown et al.ll2003l ; ICail 120051 ). 
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4 Theoretical optimality of IMs 



4.1 Intuition 



Martin et al.l (120 lOf ) showed that Fisher's fiducial inference and Dempster-Shafer theory 



are special cases of the IM framework corresponding to a singleton predictive random set. 
But it is easy to show that, for some assertions A, the fiducial probability is not valid 
in the sense of Definition [2j To correct for this bias, we propose to replace the singleton 
with some larger S. But taking S to be too large will lead to inefficient inference. So the 
goal is to take S just large enough that validity is achieved. 



4.2 Preliminaries 

Throughout the subsequent discussion, we shall assume (12.101) . i.e., Q x (u) ^ for all x 
and u. This allows us to ignore conditioning in the definition of belief functions. 

For the predictive random set Sq = {U}, with U ~ Unif (0, 1), the belief function at 
A is be\ x (A;S ) = Pu{Q x (U) C A}, where Q x (u) = {6 : x = a(6,u)} as in (|25]i . This 
is exactly the fiducial probability for A given X — x. For a general predictive random 
set S, we have be\ x (A;S) = Ps{Q x (S) C A}, where Q X (S) = [J ue s®x(u) is defined in 
(12. 8p . In light of the discussion in Section l4~Tj we shall compare the two belief functions 
be\ x (A;S) and be\ x (A;So). Towards this, we have the following result which says that 
the fiducial probability is an upper bound for the belief function. 

Proposition 1. If (I2.10p holds and the predictive random set S is valid in the sense 
of Definition^ then be\ x (A;S) < be\ x (A; Sq) for each fixed x. 

Proof. Let U X {A) = {u : Q x {u) C A}; note that S C U X (A) iff Q X (S) C A. Also, put 
b = be\ x (A; S) and b = be\ x (A; S ) = Pu{U x (A)}. If u £ U X (A), then 

Q s (u) = P S {S ^u}> P S {S C V X (A)} = P S {Q X (S) CA} = b. 

Therefore, Pu{Qs(U) > b} > P ir {U x {A) c } = 1 - b Q . Also, validity of S implies 
Pu{Qs{U) > b} < 1 - b. Consequently, 1 - b < 1 - b, i.e., be\ x (A; S) < be\ x (A; S ). □ 

For given assertion A and predictive random set S, consider the ratio 

R A (x; S) = be\ x (A; S)/be\ x (A; S ), x G X. (4.1) 

We call this the relative efficiency of the IM based on S compared to fiducial. Propo- 
sition [1] guarantees that this ratio is bounded by unity, provided that the denominator 
be\ x (A; Sq) is non-zero. Our main goal is to choose S to make this ratio large in some 
sense. Towards this goal, we have the following "complete-class theorem" which says that 
nested predictive random sets — which, by Theorems [1] and [2], produce valid IMs — are the 
only kind of predictive random sets under consideration. 

Theorem 3. Fix A C O and assume (12.101) . Given any predictive random set S, 
there exists a nested predictive random set S' such that Ra{x] S') > Ra{x] S) for each x. 
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Proof. Given S, construct a collection § = {S x : x £ X} as follows: 

s x = pi ima), 

as':bel !1 ./(A;5)>bel 1B (A;5) 

where U^A) is defined in the proof of Proposition [TJ This collection §, which will serve 
as the support for the new S', is clearly nested. Indeed, if bel Xl (A; S) < be\ X2 (A; S), then 
S X2 ^ S X1 . The distribution P$> of S' is defined as 

P S ,{S' CK}= sup b(x), KCU, 

x:S x CK 

where b(t) = be\ t (A;S)/ sup x be\ x (A;S) is the normalized belief function. In particular, 
for K = S x , we have Ps'{«S' C = b(x) > be\ x (A; S). Then we have 

be\ x (A)S') = P s ,{Q x (S')CA} 
= P S ,{S' CU X (A)} 

> Ps>{S' C S x } 

> be\ x (A;S), 

where the second equality is due to the fact that Q X (S') C A iff S' C \J X (A), and the 
first inequality is by monotonicity of P5'{5' C ■} and the fact that S x C \] X [A) for each 
x. Therefore, Ra{x; S ; ) can be no less than Ra{x] S) for each x, proving the claim. □ 

We omit the details, but if the collection U x (v4) in the proof of Proposition [1] is itself 
nested, then, in general, one can construct an "optimal" S such that Ra(X; S) = 1. This 
is done explicitly for the special case in Section 14.3.11 below. 



4.3 Optimality in special cases 

Throughout this section, we will focus on scalar X and 9. However, this is just for 
simplicity, and not a limitation of the method; see Section [5j Indeed, special dimension- 
reduction techniques, akin to Fisher's theory of sufficient statistics, are av ailable to reduce 



the dimension of observed X to that of 6 within the IM framework; see Martin and Liu 



(120121 . 120131 ) . Also, there is no conceptual difference between scalar and vector 9 problems 
so, since the ideas are new, we prefer to keep the presentation as simple as possible. 



4.3.1 One-sided assertions 

Here we consider a one-sided assertion, e.g., A = {0 £ : 9 < 8 } , where #0 is fixed. This 
"left-sided" assertion is the kind we shall focus on, but other one-sided assertions can be 
handled similarly. In this context, we can consider a very strong definition of optimality. 

Definition 3. Fix a left-sided assertion A. For two nested predictive random sets 
S and S', the IM based on S is said to be more efficient than that based on S' if, as 
functions of X ~ Px\e for any 9 £ A, Ra(X;S) is stochastically larger than Ra(X;S ; ). 
The IM based on S* is optimal, or most efficient, if Ra{X;S*) is stochastically largest, 
in the sense above, among all nested predictive random sets. 
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That optimality here is described via a stochastic ordering property is natural in light 
of the notion of validity used throughout. This definition particularly strong because it 
concerns the full distribution of Ra(X, S) as a function of X ~ Px\e, n °t just a functional 
thereof. Next we establish a strong optimality result for one-sided assertions; when the 
assertion is not one-sided, it may not be possible to establish such a strong result. 

Theorem 4. Let A = {9 G : 9 < 9 } be a left-sided assertion. Suppose that Q x (u), 
defined in 02 .5p . is such that, for each x, the right endpoint sup Q x (u) is a non- decreasing 
(resp. non-increasing) function ofu. Then, for the given A, the optimal predictive random 
set is S* = [0, U] {resp. S* = [U, 1}), where U ~ Unif(0, 1). 

Proof. First observe that both forms of S* are nested. We shall focus on the non- 
decreasing case only; the other case is similar. Since sup 0^ (it) is non-decreasing in u, it 
follows that sup OzQO, U}) = supQ x (U). Therefore, 

be\ x (A;S*) = Pu{mpe x ([0,U\) < 9 } = P t/ {sup e x (U) < 9 } = be\ x (A;S ). 

This holds for all x, so Ra(-]S*) = 1, its upper bound. Consequently, Ra(X;S*) is 
stochastically larger than Ra(X; S) for any other S, so optimality of S* obtains. □ 

Gaussian Example (cont). We showed previously that @ x (u) = {x — If we 

treat this as a degenerate interval, then we see that the right endpoint x — is a 

strictly decreasing function of u. Therefore, by Theorem HI the optimal predictive random 
set for a left-sided assertion is S* = [U, 1], U ~ Unif (0, 1). 

As an application, consider the testing problem H : 9 > 8q versus H\ : 6 < 9q. If 
we take A = (— oo,^), then the IM-based rule f!3.3p rejects i^o iff 1 — be\ x (A; S*) < a. 
With the optimal S* = [U, 1] as above, we get bel x (v4; S*) = $(# — x). So the IM-based 
testing rule rejects H i& &(9 — x) > 1 — a or, equivalently, iff x < 9 — — a). 

The reader will recognize this as the uniformly most powerful size-a test based on the 
classical Neyman-Pearson theory. 

Poisson Example (cont). In this case, Q x (u) = (G" 1 ^), G x+1 (u)]; see (12. 6p . The 
right endpoint G'^u) is strictly increasing in u. So Theorem 0] states that, for left- 
sided assertions, the optimal predictive random set is S* = [0,U], U ~ Unif(0,l). The 
same connection with the Neyman-Pearson uniformly most powerful test in the Gaussian 
example holds here as well, but we omit the details. 

4.3.2 Two-sided assertions 

Consider the case where A = {9 } c is the two-sided assertion of interest, with 9 a fixed 
interior point of 9 C R. This is an important case, which we have already considered 
in Section [21 just in a different form. These problems are apparently more difficult than 
their one-sided counterparts, just like in the classical hypothesis testing context. Here we 
present some basic results and intuitions on local IM optimality for two-sided assertions. 

Assume Px\e is continuous. Then the fiducial probability belx({#o} c ; <^o) f° r th e two- 
sided assertion is unity, and so the relative efficiency (14.11) is simply be\ x ({9 } c ; S). Here 
we focus on predictive random sets S with the property that be\ x ({9o} c ; S) ~ Unif (0, 1) 
under Px|e ; see Corollary[TJ Based on the intuition developed in Section^ belx({#o} c ; «5) 
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should be smallest (probabilistically) under Px\e for 9 = 9 . We shall, therefore, impose 
the following condition on the predictive random set S: 

P xle {be\ x ({6 y ] S)<a}<a, V^i Vae(0,l). (4.2) 

Roughly speaking, condition (I4.2p states that the belief function at {#o} c is stochastically 
larger under Px\e than under Px\e - There is also a loose connection between (14. 2p and the 
classical unbiasedness co ndition imposed to construct optimal tests when the alternative 



hypothesis is two-sided ( iLehmann and Romano! 120051 . Ch. 4). Our goal in what follows 



is to find a "best" predictive random set that satisfies (14. 2p . 

To make things formal, suppose that both X and G are one-dimensional, that Px\e 
is continuous with distribution function Fg(x) and density function fg(x), and that the 
usual regularity conditions hold; in particular, we assume that the order of expectation 
with respect to Px\e and differentiation with respect to 9 can be interchanged. Note that 
we have fixed a parametrization, and the analysis that follows depends on this selection. 
Let Tg(x) = (d/d8) log fg(x) be the score function, an important quantity in what follows. 
Also, let Vg(x) = Tg(x) 2 + (d/dQ)Tg(x). Then, under the usual regularity conditions, we 
have E x \e{T e (X)} = and E x \e{V e (X)} = for all 9. 

In Appendix |A] we argue that a good predictive random set S must have a support 
with certain symmetry or balance properties with respect to the sampling distribution of 
Tg (X). In particular, let B = {B t : t G 1} be a generic collection of nested measurable 
subsets of T = Tg (K). The collection B shall be called score-balanced if 



-x\e 



{Tg o (X)I Bt (T 9o (X))} = 0, VtGT. (4.3) 



For a score-balanced collection B = {B t } satisfying (14. 3 p we can define a corre- 
sponding score-balanced predictive random set S = Sb as follows. Define the class 
§ = {S t : t e T} of subsets of U = [0, 1] given by 

S t = F 6o ({x:T eo (x)eB t }). 

For simplicity, and without loss of generality, assume S contains and U. Now take a 
predictive random set Sb, supported on §, such that its measure Ps B satisfies 

Ps b {SbCK}= sup Pu(St), KQ[0,1], 

kStQK 

where Pu is the Unif (0, 1) measure. (The set St is P [/-measurable for all t by the assumed 
measurability of B t , Tg , and Fg .) The corresponding score-balanced belief function is 

be\ x ({9 o r;S B ) = Ps B {SB^F 6o (x)} 

= Px\e {B Teo (x) ^T do (x)} 
= Px\e {Te (X) e B Teo (x)}, 

where the last equality follows from the assumed nesting of {B t }. Proposition [3] in 
Appendix [A] shows that predictive random sets which are good in the sense that (14. 2p 
holds (at least locally) must be score-balanced. 

But there are many such Sb to choose from, so we now consider finding a "best" 
one. A reasonable definition of optimal score-balanced predictive random set is one that 
makes the difference between the right- and left-hand sides of (14.21) as large as possible 
for each 9 in a neighborhood of 6q. Then, for two-sided assertions, we have 
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Definition 4. Let B* = {B^ : t G T} be such that, for each £, 

[ V do (x)fe {x)dx (4.4) 

JTg (x)eB* 

is minimized subject to the score-balance constraint (14.31) . Then S* = Sb* is the optimal 
score-balanced predictive random set. 

Here we give a general construction of an an optimal score-balanced predictive random 
sets. Proving that the predictive random sets satisfy the conditions of Definition H] will 
require assumptions about the model. Start with the following class of intervals: 

£*=(£_(£),£+(£)), £GT, (X), (4.5) 

where the functions £ + (which depend implicitly on 9q) are such that (14. 3 p holds. In 
addition, we shall assume these functions are continuous and satisfy 

• £-(£) is non-positive, £-(£) = £ for £ G (— oo, 0) and is decreasing for £ G [0, oo); 

• £+(£) is non-negative, £+(£) = £ for £ G [0, oo) and is increasing for £ G (— oo, 0). 

The functions £ + describe a sort of symmetry/balance in the distribution of Tg (X): 
they satisfy £+(£_(£)) = £ and £_(£+(— £)) = — £ for all £ > 0. In some cases, for given £, 
expressions for £_(£) and ^+(£) can be found analytically, but typically numerical solutions 
are required. Set S* = Sb*- We claim that, under certain conditions on Vo Q {x), S* is 
optimal in the sense of Definition |H 

Before we get to the optimality considerations, we first verify the assumption that 
be\x{{0o} c ] S*) ~ Unif(0, 1) under Px\e - From the definition of B*, it is clear that 

TeB} <^> C-(t)<T<Ut) 

<=► Ut) < UT) < e+(T) < £+(£) 

^ e + (T)-e-(r)<e + (*)-e-(*)- 

Consequently, if Dg Q {X) = Z + {Tg {X)) - ^{T 9o {X)), then 

bel x ({^ } c ;^) = P xleo {T 6o (X) G B* TeQ (x)} = P x]eo {D 9o (X) < D 9o (x)}. 

Therefore, since Dg (X) is a continuous random variable, an argument like that in Corol- 
lary [Ushows that be\ x ({0 o } c ; S*) ~ Unif (0, 1) under P X \e . 

We are now ready for optimality of S*. Write V(t) for Vg (x), when treated as a 
function of £ = Tg Q (x). The condition to be imposed is: 

V(t) is uniquely minimized at £ = 0, and V(0) < 0. (4-6) 

This condition holds, e.g., for all exponential families with 9 the natural parameter. 

Proposition 2. Under condition (14. 6p . the score balanced predictive random set S* = 
Sb*, with B* described above, is optimal in the sense of Definition^ 
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Proof. The proof is simple but tedious so here we just sketch the main idea. Under 
(14. 6j) , the intervals B\ which are "balanced" around Tg (x) = 0, make most efficient use of 
the space where Vg (x) is smallest in the following sense. They are exactly the right size 
to make Sb* efficient, so any other efficient score-balanced predictive random set Sb must 
be determined by sets B = {B t } other than intervals concentrated around Tg (x) = 0. 
Since such intervals are where Vg (x) is smallest, the integral in (I4.4p corresponding to 
B t must be larger than that corresponding to B*. Therefore, S* satisfies the conditions 
of Definition |4] and, hence, is optimal. □ 

Unfortunately, (14. 6p is not always satisfied. For example, it can fail for exponential 
families not in natural form. But we claim that (14. 6 p is not absolutely essential. Assume 
V(t) is convex and ^(0) < 0. This relaxed assumption holds, e.g., for all exponential 
families. To keep things simple, suppose that V(t) is minimized at t > 0. Although 
the argument to be given is general, Figure |5^a) illustrates the phenomenon for the 
exponential distribution with mean 8q = 1. The heavy line there represents V(t), and 
the thin lines represent th(t) (black) and V(t)h(t) (gray), where h(t) is the density of T. 
The horizontal lines represent the intervals B* in (14. 5 p for select t. By convexity of V(t), 
there exists to such that t e (0,to) and V(t) < U(0) for each t e (0,io); this is (0,0.5) 
in the figure. For t G (0, to), the intervals B* do not contain (0,to); these intervals are 
shown in black. In such cases, the integral (I4.4p can be reduced by breaking B* into two 
parts: one part takes more of (0,to), where V(t) is smallest, and the other part is chosen 
to satisfy the score-balance condition (14. 3p . But when t > to, no improvement can be 
made by changing B£; these cases are shown in gray. So, in this sense, the intervals B* 
in ( 14. 5 p are not too bad even if ( 14. 6 p fails. 

On the other hand, violations of ( 14. 6 j) are due to the choice of the parametrization. 
Indeed, under mild assumptions, there exists a transformation rj = rj(9) such that the 
corresponding V(t) function for rj satisfies ( 14. 6p . Then the predictive random set S* in 
Proposition [2] is the optimal for this transformed problem. 

Gaussian Example (cont). This is a natural exponential family distribution, so Propo- 
sition |2] holds, and S* is the optimal score-balanced predictive random set. Here the score 
function is Tg(x) = x — 9. Under X ~ N(#, 1), the distribution of Tg(X) is symmetric 
about 0. Therefore, B* = (—\t\,\t\), and the corresponding predictive random set is 
supported on subsets St given by 

S t = F 9o ({x : \x - 9 \ < \t\}) = m-\t\), 0>(|t|)), 

with belief function be\ x ({9o} c ; S*) = 2$(|x — 9q\) — 1. This is exactly one minus the 
plausibility function in ( I2.12p based on the default predictive random set (12. 7p . Therefore, 
we conclude that the (12 .7p is, in fact, the optimal score-balanced predictive random set in 
the Gaussian problem. This is consistent with our intuition, given that the results based 
on this default choice in the Gaussian example match up with good classical results. 

Exponential Example. Suppose X is an exponential random variable with mean 9, as 
discussed above. Unlike the Gaussian, this distribution is asymmetric, so, for the optimal 
score-balanced IM, a numerical method is needed to identify the set Bt 0o ( x ) for each 
observed x. Plots of the corresponding plausibility functions p\ x (9; S) = 1 — bel x .({#} c ; S) 
for two different predictive random sets based on X = 5 are shown in Figure [2](b). The 
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Figure 2: Specifics of Panel (a) are discussed in the text. Panel (b) shows p\ x (6;S), as a 
function of the exponential scale parameter 9, for two predictive random sets S: optimal 
score-balanced (black) and default (gray). Vertical line marks the observed X = 5. 



black line is based on the optimal score-balanced predictive random set, and the gray 
line is based on the default predictive random set in (12.71) . 90% plausibility intervals, 
determined by the horizontal line at a = 0.1, are much shorter for the score-balanced IM 
compared to the default in this case. For comparison, one might consider a crude nominal 
90% confidence interval for 9, namely, (Je~ L65 , Xe 1,65 ), based on a variance-stabilizing 
transformation and normal approximation. These intervals tend to be shorter than both 
plausibility intervals, but their coverage probability (^ 0.82) is too small. 

Poisson Example (cont). Although the theory above holds only for continuous models, 
the score-balanced predictive rando m set performs well i n discrete problems too. For the 
sake of space, we refer the reader to Martin et al.l ( 120121 ) for the details. 



5 Two more examples 

5.1 A standardized mean problem 



Suppose that Xi, . . . , X n are independent N(/i, a ) observations. The goal is t o make in- 



ferenc e on ip = fi/a, the standardized mean, or signal-to-noise ratio. Following iDempster 



( 119631 ). we start with a reduction of the full data to the sufficient statistics for 9 = (fi, a 2 ), 
namely (X,S 2 ), the sample mean and variance. Formal IM-based justification for this 
reduction is available, though we shall not discuss this here. 
For the A-step, we take the association to be 

X = (j, + n- xl2 oXJ x and S = aU 2 , (5.1) 

where U = (U U U 2 ) ~ P v = N(0,1) x {ChiSq(n - l)/(n - 1)} 1/2 - After replacing a in the 
left-most identity in (15.11) with S/U2, a bit of algebra reveals that 

n l ' 2 X/S = (n 1 / 2 ^ + U 1 )/U 2 and S = aU 2 . 
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For 6 = (ip, a), make a change of auxiliary variable v = <pe(u), given by 



/n l/2 i) + ui\ exp(w 2 ) 
v i = ( ) and v 2 - 



u 2 ) 1 + exp(« 2 ) ' 

where is the distribution function for t n _i(n 1 / 2 ^), a non-central Student-t distribution 
with n — 1 degrees of freedom and non-centrality parameter n 1 ^ 2 "^. Note that the full 
generality of the parameter-dependent change-of- variables in Corollary |2] is needed here. 
Then the transformed association is 



n 



1 ' 2 X/S = F- 1 {V l ) and S = a log{V 2 /(l - V 2 )}, 



and the measure Py on the space of V = (Vi, V 2 ) has a Unif (0, 1) marginal on the Vi-space; 
the distribution on V^-slices of the V 2 space can be worked out, but it is not needed in 
what follows. For the P-step, we predict v* = <po(u*) with a rectangle predictive random 
set S defined by the following set- valued mapping, similar to ( 12. 7ft : 

v = (v u v 2 ) h-> {v[ : \v[ - 0.5| < \vi - 0.5|} x [0, 1]. (5.2) 

Optimality considerations along the lines in Section 14.3.21 could be pursued here, but we 
choose to keep things simple since analysis of the non-central Student-t distribution is 
non-trivial. An important direction of future research is to develop numerical methods 
for evaluating optimal IMs. Using a predictive random set that spans the entire v 2 - 
space for each v has the effect of "integrating out" the nuisance parameter a. For the 
predictive random set S in (15.21) . if z = n 1 ' 2 x/s : then the C-step gives the following set 
9 x (iS) = ^x(S) x T, X (S) of candidate (ip,cr) pairs: 

{ij;:\F^(z)-0.5\<\V 1 -0.5\}x{a:a>0}, V ~ P v . (5.3) 

For assertions A = {(ip, a) : o > 0} the plausibility function is given by 

p\ x (A) = Ps{Q x (<S) £ A°} = Ps{V x (S) 3^} = 1- \2F4z) - 1|. 

In this case, the 100(1 — a)% plausibility interval H. x {oi) for ijj is obtained by inverting 
the inequality 1 — \2F^(z) — 1| > a, i.e., H x (a) = {ip : a/2 < F^(z) < 1 — a/2}. 

This is exactly the usual frequentist confidence interval based on the sampling distri- 
bu tion of the standa rdiz ed sample mean; it also a grees with the fiducial intervals obtained 



by iDempsterl (119631 ) and lDawid and Stond (11982I ). The standard frequentist approach re 



lies on an informal "plug-in style" marginalization, whereas the IM approach above shows 
exactly how a is ignored via cylinder assertions. More sophisticated IM marginalization 
techniques are available, but we do not discuss these here. 



5.2 A many-exponential-rates problem 

For our last example, we consider a high- dimensional problem. Suppose that X = 
(X±, . . . ,X n ) consists of independent observations X; ~ Exp(6 ) i ), i = 1, . . . ,n, with un- 
known rates 9\, . . . ,9 n . The goal is to give a probabilistic measure of the support in 
X = x for the assertion A = {6i = ■ ■ ■ = 9 n } that the rates are equal. A version of this 
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problem was also discussed in lMartin et all (120 lOh . but here we simplify the presentation, 
emphasize the three-step IM construction, and produce much better results. 

Start, in the A-step, with the association X, = Ui/0i, i = 1, . . . , n, where Pu is the 
product measure Exp(l) xn . Make a change of auxiliary variables v = <p(u): 



v o = XXi u i and v i = u i/ v o, i = 1, 



n. 



The new vector v = (v , V\, . . . , v n ) takes values in V = (0, oo) x P n _i, where P n _i is the 
(n — 1) -dimensional probability simplex in M. n , and Py = Pu<£~ 1 is the product measure 
Gammafn, 1) x Dir n (l n ). Then the modified association is 



X t = VoVi/Oi 



l,...,n, where V = (V , V u . . . , V n ) ~ P v . 



(5.4) 



For the P-step, we shall consider the following predictive random set S characterized by 
V ~ Py and the set- valued mapping v H- {V : h(v') < h(v)}. In this case, we take 



ra-1 



K v ) = - ^2,[adogU{v) + &ilog{l - ti(v)}], 



t=i 



with t,- 



E 



l/(n 



0.3), and hi = l/(z — 0.3). A few remarks on this 



choice of 5 are in order. First, it follows from Corollary [T] that S is efficient. Second, the 
random vector (ti(V), . . . , i n _i(V)), for V ~ Pv, has the distribution of a vector of n — 1 



sorted Unif (0, 1) random variables, and lZhangl (120101 Sec. 3.4.2) shows that S provides an 
easy-to-compute alternative to the well-perform i riK hie rarchical predictive random set for 
predicting sorted uniforms used in lMartin et al.l ( 120101 ). Finally, that the first component 
vo of v is essentially ignored in S is partly for convenience, and partly because Vq is related 
to the overall scale of the problem which is irrelevant to the assertion A of interest. 

For the C-step, combining the observed data, the association model (15. 4p . and the 
predictive random set S above, we get the following random set for 9: 

e x (S) = {9 : h(v(x, 9)) < h(V)}, V ~ P v , 

where v(x, 9) = (9\Xi, . . . , 9 n x n )/ YTj=i @j x j- Since the assertion A = {9i = ■ ■ ■ = 9 n } is a 
one-dimensional subset of 0, the belief function is zero. It is also important to note that 
when 9 is a constant vector, v(x, 9) is independent of that constant, i.e., v(x, 9) = v(x, l n ), 
which greatly simplifies computation of the plausibility function at A. Indeed, 

p\ x (A) = Pv{h(V)>h(v(x,l))}, 



which can easily be evaluated using Monte Carlo. As described in Section 13.41 the level 
a IM-based tests rejects the assertion A if and only if pi (A) < a. 

For illustration, we compare our results with those of Martin et al.l (j2010h . They con- 



sider the basic likelihood ratio test, which is based on the test statistic { (fllLi x 



l/n / i n 

fx} . 

They also consider a different sort of IM solution, based on thresholding the plausibility 
function, but with a default type of predictive random set that uses a Kullback-Leibler 
neighborhood for predicting the component (Vi, . . . , V n ) of V. We compare the power of 
these three tests in several different cases. In each setup, n = ri\ + ri2 = 100 observations 
are available, but the first ri\ exponential rates equal 1 while the last equal 9. Figure [3] 
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(a) (m,n 2 ) = (50,50) 



(b) (ni,n 2 ) = (10,90) 



Figure 3: Estimated powers of the likelihood ratio and two IM-based tests for the simu- 
lation described in Section 15.21 Here 9 is the ratio of the rate of the last ri2 observations 
to that of the first n\. 



shows the power functions over a range of 9 values for two configurations of (711,712). 
Here we see that, in both cases, the likelihood ratio and old IM tests have similar power, 
possibly because of the common connection to the Kullback-Leibler divergence. On the 
other hand, the new IM-based test presented above has strikingly larger power than the 
other two. This substantial improvement in power is likely due to the close relationship 
between our choice of S and the assertion of interest. So while the comparison between 
the new IM results and those of the other "default" methods is not entirely fair, it is 
interesting to see that an assertion-specific choice of predictive random set can lead to 
drastically improved performance. 

6 Discussion 

The conversion of experience to knowledge is fundamental to the advancement of science, 
and statistical inference plays a crucial role. For ages, there has been disagreement about 
which statistical paradigm to choose. Both the frequentist and Bayesian paradigms have 
their own set of advantages and disadvantages, so it would be worthwhile to identify some- 
thing new which combines the respective advantages but loses, or at least weakens, the 
disadvantages. Here we have described a three-step procedure to construct IMs for prior- 
free, post-data probabilistic inference, and proved that IMs yield frequency-calibrated 
probabilities under very general conditions. The point is that the values of the corre- 
sponding belief /plausibility function are meaningful both within and across experiments, 
accomplishing both the frequentist and Bayesian goals simultaneously. 

The proposed IM approach is surely new, but since new is not always better, it is 
natural to ask what is the benefit of using IMs. Our response is that, although it will 
take time for users to familiarize themselves with the thought process, the IM framework is 
logical, intuitive, and able to produce meaningful and frequency-calibrated probabilistic 



22 



measures of uncertainty about 9 without a prior distribution. The latter property is 
something that no other inferential framework is able to achieve. 

Admittedly, the final IM depends on the user's choice of association and predictive 
random set, but we do not believe that this is particularly damning. Section H] laid the 
foundation for a theory of optimal predictive random sets, and further efforts to develop 
"default" predictive random sets are ongoing, particularly for multi-parameter problems. 
But a case can be made to prefer the ambiguity of the choice of predictive random set over 
that of a frequentist's choice of statistic or Bayesian's choice of prior. The point is that 
neither a frequentist sampling distribution nor a Bayesian prior distribution adequately 
describes the source of uncertainty about 9. As we argued above, this uncertainty is 
fully characterized by the fact that, whatever the association, the value of u* is missing. 
Therefore, it seems only natural to prefer the IM framework that features a direct attack 
on the source of uncertainty over another that attacks the problem indirectly. Moreover, 
as was demonstrated in Section 15.21 choosing the predictive random set that depends on 
the problem and/or assertion of interest can lead to drastically improved results. 

We note that differences between IM outputs from different predictive random sets are 
slight for assertions involving one-dimensional quantities. However, for high-dimensional 
auxiliary variables, the choice of predictive random set deserves special attention. In 
such cases, our approach is to construct predictive random sets for functions of auxiliary 
variables that are most relevant to the assertions of interest. This leads to a practically 
useful auxiliary variable dimension reduction. It is interesting that this approach ha s 
some close connections to Fisher's theory of sufficient statistics ( Martin and Liul 2012 ). 
For nuisance parameter probl ems, like those in Se ction [51 there is a different form of 
dimension reduction required fjMartin and Liul 120131 ). 

Of course, compared to the well-developed Bayesian and frequentist methods, IMs 
have many open problems. Both theoretical work and applications have shown that 
the IM framework is promising. Given the attractive properties of IMs developed here 
and in the references above, we expect to se e more exciting advancements in IMs or 



new inferential frameworks (e.g.. Martini 120121 ) that are probabilistic and have desirable 
frequency properties. 
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A Details from Section 4.3.2 



If we assume that be\x({9o} c ', S) ~ Unif (0, 1) under Px\e a , then there exists a collection 
of measurable subsets X(a) C X, depending implicitly on # an d S, such that, for each 
ot-i Px|6» {^( a )} = an d bel x ({6'o} c ; S) < a iff x e X(a). It follows that, for any 9, 



Px\e{be\ x ({9 Q } c ;S) < a} = MO) 



fe(x) dx. 
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By definition, ip a (9o) = £*. Now, (14. 2 p is equivalent to < ^ a (9o) for all a, or, to 

put it another way, ip a (9) is maximized at 9 — 9q for all a. Under the stated regularity 
conditions, this maximization is equivalent to the claim that, for all a £ (0, 1), the first 
and second derivatives of ip a (9) at 9 = 9 satisfy 

VM= [ T 6o (x)f eo (x)dx = 0, (A.l) 

JX(a) 



€(#o) = / V do (x)f 9o (x)dx<0. (A.2) 

JX(a) 

Since Tg (X) has mean zero under Px\e , we can see that (1A.1I) requires X(a) to be some- 
how symmetric, or balanced, with respect to the distribution of Tg (X). We, therefore, 
refer to (IA.1I) as the score-balance condition. This condition, expressed in terms of X(a) 
in (lA.ip . can be traced back to a corresponding condition on the predictive random set. 

Let us now assume that Sb is such that belx({#o} c ; $b) ~ Unif (0, 1) under Px\9 ] in 
the main text we construct a particular score-balanced predictive random set and show 
that that this assumption holds. Then, as we argued above, for any a £ (0,1), there 
exists t(a) £ T such that be\ x ({9o} c ; Sb) < a iff Tq {x) £ B t ( a y In this case, for any 9, 



Px\e{be\ x ({9 } c ;SB)<a}= / fe(x)dx, 

JTg (x)GB tia) 

and the right-hand side is ip a {9) as defined previously. From the definition of B, differ- 
entiating under the integral sign reveals that (IA.1I) holds. We can now prove 

Proposition 3. Focus on predictive random sets S such that belx({#o} c ; «5) ~ Unif (0, 1^ 
under Px\e - Then condition (14. 2 p holds for all 9 in a neighborhood oJ9q iff the predictive 
random set S = Sb is score-balanced and 

[ Vg o {x)f 9o (x)dx<0, Vt£T. (A.3) 

JTg {x)(LB t 

Proof. Take 9 close enough to 9q such that the remainder terms in a second-order 
Taylor approximation of ip a {9) about 9 = 9 can be ignored. That is, for any a, 

- MOo) = I T eo (x)fe (x) dx ■ (9 - 9 ) 

JTg {x)€B t(oi) 

1 



2 



V eo (x)f 6o (x) dx-(9-9 ) 2 . 

Tg (x)£B t(a) 



The first terms vanishes and the second term is negative by (1A.3jl . Therefore ip a (9) < 



"0a(0o) for all a and, hence, (14 .2p holds for all 9 in a neighborhood of 9 . □ 
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B Corrections — added post-publication 



B.l Correction of Theorem 1 

In the main text, for validity of the predictive random set S, the support S was assumed 
only to be nested, i.e., for any S,S' G S, either S C 5" or S' C S. However, some 
additional technical conditions are required for the proof to go through. 

Fix a topology on the auxiliary variable space U, and let the cr-algebra defined there 
contain all the open sets. In addition to being nested, we shall assume that § contains 
both and U, and that all of its contents are closed subsets of U. These additional 
requirements result in no real loss of generality. Indeed, those predictive random sets in 
Corollary 1 of the main text already satisfy these. These extra conditions also make the 
statement and proof of the theorem more transparent. 

Theorem V . Let § be a nested collection of closed Pu -measurable subsets o/U that 
contains and U. Define a predictive random set S, with distribution P$, supported on 
S, such that 

P S {$ C K} = sup Pu{S), K C U. 
ses-.scK 

Then S is valid in the sense of Definition 1 in the main text. 

Proof. Set Q{u) = Ps{<S jt u}. For any a G (0, 1), let S a be the smallest S G § such that 
Ps{$ P L r(S) > 1 - a. In particular, S a = f]{S G § : P V (S) > 1 - a}. Since 

each S is closed, so is S a ; it is also measurable by our assumptions about the richness of 
the cr-algebra on U. The key observation is that Q(u) > 1 — a iff u G S^. Therefore, by 
continuity of Pry from above, we get 

Pu{Q(U) >l-a} = Pu{S c a ) = 1 - Pu(S a ) = 1 - Km P V (S) 7 

where the limit is over all S decreasing to S a . By construction, each such S satisfies 
Pu(S) > 1 — a. So, finally, we get Pu{Q(U) > 1 — a} < a and, since a is arbitrary, the 
claimed validity is proved. □ 

B.2 Correction/extension of Theorem 3 

Theorem 3 in the main text says that nested predictive random sets are more efficient 
than those which are not nested. However, the nested predictive random set constructed 
in that theorem is not necessarily valid. Since validity is a key to the IM analysis, it would 
be desirable if the new nested predictive random set S' was also valid. We accomplish 
this in Theorem 3' below. First, we need the following lemma. 

Lemma 0. On a space U equipped with probability Pu, let S be a valid predictive 
random set for U ~ Pu- Choose a collection of Pu -measurable subsets {V x : x G X} of 
U, and set rj(x) = Ps{<S Q U^}. Then 

inf V (x) < P v { P| U„} 
for any subset X o/X such that C\ xe x ^ x ^ s -measurable. 
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Proof. First, note that if u £ U x , then Q(u) = Ps{<S j$ u} > rj(x). Therefore, if 
u £ UzeXo U^, then Q(u) > inf xG x v( x )- This argument implies 

Pu{Q(U) > }g Q V(x)} > Pu{ |J U^.} = 1 - P v { f| U x }. 
Since S is valid, we have 

Pu\Q(U) > inf 77O)) < 1 - inf t]{x); 
Combining this with the inequality in the previous display, we get 

1- inf rj(x) >1-Pu\n UA 

x£X I ' ' ) 

which implies inf^Xo rj(x) < Pu{C\xex n 

A measurability question was overlooked in the main text. In particular, the sets in 
( IB . 1 P below are not automatically measurable. To confirm this, we shall add one more 
modification; note that this is not needed if the sampling model Px\e is discrete. To 
start, for the given topology on U, keep the same assumptions about the corresponding 
cr-algebra as above. Now, recall the a-events ^J X (A) = {u £ U : Q x (u) C A} defined 
in the proof of Proposition 1 in the main text. Here we shall replace U x (t4) with its 
closure. This does not affect any properties of the resulting belief function when is 
non-atomic. In all the examples we have considered, Pu can be taken as continuous; this 
is a particularly convenient choice, in light of Corollary 1 in the main text. 

Theorem 3'. Suppose that either X is a discrete space, or that the assumptions in the 
previous paragraph hold. Fix AC0 and assume condition (2.10) in the main text. Given 
any valid predictive random set S, there exists a nested and valid predictive random set 
S' such that be\ x (A;S') > be\ x (A;S) for each x £ X. 

Proof. For the given A and S, set b(x) = bel^A; S). Define a collection §' = {S' x : x £ X} 
of subsets of U as follows: 

S' x = P| V y (A), (B.l) 

yeX:b(y)>b(x) 

where U X (A) is the new closed a-event. If necessary, add and U to §' to satisfy the 
requirement in Theorem 1'. This collection §' will serve as the support for the new 
predictive random set S'. First, we can see that §' is nested: if b(y) > b(x), then 
S y 3 S x . Second, since the new a-events are closed, each S' x in (1B.1I) is closed and, hence, 
Pjy-measurable. Third, define the measure P5/ for S' to satisfy 

Ps>{S'CK}= sup Pu(S' x ). 

x:S' x CK 

According to Theorem 1', the new S' is valid. Moreover, by Lemma and the definition 
of S' x , we have 

P ^)> M u b(v) = b(z) = \xA lt (A-,S). (B.2) 

y<£X:b(y)>b(x) 



28 



Finally, we have a comparison of the belief functions corresponding to S and S': 

belts') = P S ,{S' C V X (A)} > P S ,{S' C ^} = P^) > be\ x (A;S); 

the first inequality follows from monotonicity of Ps'{iS' C •} and the fact that S' x C U X (A) 
for each x, and the second inequality follows from flB.2j) . □ 



29 



